Parsing Social Network Survey Data from Hidden Populations Using Stochastic Context-Free Grammars

نویسندگان

Art F. Y. Poon

Kimberly C. Brouwer

Steffanie A. Strathdee

Michelle Firestone-Cruz

Remedios M. Lozada

Sergei L. Kosakovsky Pond

Douglas D. Heckathorn

Simon D. W. Frost

چکیده

BACKGROUND Human populations are structured by social networks, in which individuals tend to form relationships based on shared attributes. Certain attributes that are ambiguous, stigmatized or illegal can create a OhiddenO population, so-called because its members are difficult to identify. Many hidden populations are also at an elevated risk of exposure to infectious diseases. Consequently, public health agencies are presently adopting modern survey techniques that traverse social networks in hidden populations by soliciting individuals to recruit their peers, e.g., respondent-driven sampling (RDS). The concomitant accumulation of network-based epidemiological data, however, is rapidly outpacing the development of computational methods for analysis. Moreover, current analytical models rely on unrealistic assumptions, e.g., that the traversal of social networks can be modeled by a Markov chain rather than a branching process. METHODOLOGY/PRINCIPAL FINDINGS Here, we develop a new methodology based on stochastic context-free grammars (SCFGs), which are well-suited to modeling tree-like structure of the RDS recruitment process. We apply this methodology to an RDS case study of injection drug users (IDUs) in Tijuana, México, a hidden population at high risk of blood-borne and sexually-transmitted infections (i.e., HIV, hepatitis C virus, syphilis). Survey data were encoded as text strings that were parsed using our custom implementation of the inside-outside algorithm in a publicly-available software package (HyPhy), which uses either expectation maximization or direct optimization methods and permits constraints on model parameters for hypothesis testing. We identified significant latent variability in the recruitment process that violates assumptions of Markov chain-based methods for RDS analysis: firstly, IDUs tended to emulate the recruitment behavior of their own recruiter; and secondly, the recruitment of like peers (homophily) was dependent on the number of recruits. CONCLUSIONS SCFGs provide a rich probabilistic language that can articulate complex latent structure in survey data derived from the traversal of social networks. Such structure that has no representation in Markov chain-based models can interfere with the estimation of the composition of hidden populations if left unaccounted for, raising critical implications for the prevention and control of infectious disease epidemics.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Recognition of on-line handwritten mathematical expressions using 2D stochastic context-free grammars and hidden Markov models

This paper describes a formal model for the recognition of on-line handwritten mathematical expressions using 2D stochastic context-free grammars and hidden Markov models. Hidden Markov models are used to recognize mathematical symbols, and a stochastic context-free grammar is used to model the relation between these symbols. This formal model makes possible to use classic algorithms for parsin...

متن کامل

Training Stochastic Grammars From Unlabelled Text Corpora

The paper describes various aspects and practicalities of applying the "Hidden Markov" approach to train parameters of regular and contextfree stochastic grammars. The approach enables grammars to be trained from unlabelled text corpora, providing flexibility in the choice of syntactic categories and text domain. Part-of-speech tagging and parsing are discussed as applications. Linguistic consi...

متن کامل

Recent Advances of Grammatical Inference

In this paper, we provide a survey of recent advances in the field “Grammatical Inference” with a particular emphasis on the results concerning the learnability of target classes represented by deterministic finite automata, context-free grammars, hidden Markov models, stochastic contextfree grammars, simple recurrent neural networks, and case-based representations.

متن کامل

Unlexicalised Hidden Variable Models of Split Dependency Grammars

This paper investigates transforms of split dependency grammars into unlexicalised context-free grammars annotated with hidden symbols. Our best unlexicalised grammar achieves an accuracy of 88% on the Penn Treebank data set, that represents a 50% reduction in error over previously published results on unlexicalised dependency parsing.

متن کامل

Inducing Compact but Accurate Tree-Substitution Grammars

Tree substitution grammars (TSGs) are a compelling alternative to context-free grammars for modelling syntax. However, many popular techniques for estimating weighted TSGs (under the moniker of Data Oriented Parsing) suffer from the problems of inconsistency and over-fitting. We present a theoretically principled model which solves these problems using a Bayesian non-parametric formulation. Our...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره 4 شماره

صفحات -

تاریخ انتشار 2009

Parsing Social Network Survey Data from Hidden Populations Using Stochastic Context-Free Grammars

نویسندگان

چکیده

منابع مشابه

Recognition of on-line handwritten mathematical expressions using 2D stochastic context-free grammars and hidden Markov models

Training Stochastic Grammars From Unlabelled Text Corpora

Recent Advances of Grammatical Inference

Unlexicalised Hidden Variable Models of Split Dependency Grammars

Inducing Compact but Accurate Tree-Substitution Grammars

عنوان ژورنال:

اشتراک گذاری